LLM 25-Day Course - Day 2: NLP Fundamentals and Terminology

Day 2: NLP Fundamentals and Terminology

To understand LLMs, you first need to know NLP (Natural Language Processing) terminology. Today we’ll organize the NLP terms that appear most frequently in LLM papers and documentation.

NLP Core Terminology

Term	Description	Example
Token	The smallest processing unit of text	”Hello world” -> [“Hello”, ” world”]
Corpus	A collection of text data used for training	All of Wikipedia, news article collections
Vocabulary	The set of all tokens the model knows	GPT-4’s vocabulary size: ~100,000 tokens
Embedding	Words converted into numeric vectors	”king” -> [0.2, -0.5, 0.8, …]
Sequence	An ordered arrangement of tokens	A single sentence or paragraph
Attention	A mechanism that focuses on important parts of the input	In “He ate the apple,” determining who “he” refers to
Encoding	Converting input into internal representation	Sentence -> vector
Decoding	Converting internal representation into output	Vector -> sentence
Perplexity	A metric of model prediction uncertainty (lower is better)	PPL=15: roughly 15 candidates for the next word
Context Window	The number of tokens a model can process at once	Modern models support tens to hundreds of thousands of tokens

Basic Tokenization Concept

# Simplest tokenization: splitting by whitespace
sentence = "Natural language processing is really fascinating"
tokens_simple = sentence.split()
print(tokens_simple)
# ['Natural', 'language', 'processing', 'is', 'really', 'fascinating']

# Real LLMs use subword tokenization
# "fascinating" -> ["fasc", "inating"] — split into smaller pieces

Intuitive Understanding of Embeddings

import numpy as np

# Embeddings: representing words as numeric vectors
# Words with similar meanings are located close together in vector space
embeddings = {
    "king":   np.array([0.8, 0.2, -0.5, 0.9]),
    "queen":  np.array([0.7, 0.3, -0.4, 0.85]),
    "apple":  np.array([-0.2, 0.9, 0.6, -0.1]),
}

# Measuring similarity between words using cosine similarity
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"king-queen similarity: {cosine_similarity(embeddings['king'], embeddings['queen']):.3f}")
print(f"king-apple similarity: {cosine_similarity(embeddings['king'], embeddings['apple']):.3f}")
# king-queen: high similarity / king-apple: low similarity

Perplexity Calculation Example

import numpy as np

# Perplexity: how well a model predicts the next word
# PPL = exp(average cross-entropy loss)
def calculate_perplexity(loss):
    return np.exp(loss)

good_model_loss = 2.7    # Well-trained model
bad_model_loss = 5.5     # Poorly trained model

print(f"Good model PPL: {calculate_perplexity(good_model_loss):.1f}")
print(f"Bad model PPL: {calculate_perplexity(bad_model_loss):.1f}")
# Lower PPL means better next-word prediction

NLP terminology cannot be memorized in a single day. Use this table as a reference as we dive deeper into each concept in the days ahead.

Today’s Exercises

Tokenize the sentence “Artificial intelligence is changing the world” by whitespace, by syllable, and by meaning. Explain the differences.
Summarize the pros and cons of larger embedding vector dimensions. Compare Word2Vec (300 dimensions) with modern large model embeddings (higher dimensions).
Think about what it means if Perplexity equals 1, and whether a model with PPL=1 is achievable in practice.